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DOCUMENT SIMILARITY DETECTION 
AND CLASSIFICATION SYSTEM 

Abstract 

A document similarity detection and classification system is pre- 
sented. The system employs a case-based method of classifying 
electronically distributed documents in which content chunks of 
an unclassified document are compared to the sets of content 
chunks comprising each of a set of previously classified sample 
documents in order to determine a highest level of resemblance 
between an unclassified document and any of a set of previously 
classified documents. The sample documents have been manually 
reviewed and annotated to distinguish document classifications 
and to distinguish significant content chunks from insignificant 
content chunks. These annotations are used in the similarity com- 
parison process. If a significant resemblance level exceeding a 
predetermined threshold is detected, the classification of the most 
significantly resembling sample document is assigned to the un- 
classified document. Sample documents may be acquired to build 
and maintain a repository of sample documents by detecting un- 
classified documents that are similar to other unclassified docu- 



merits and subjecting at least some similar documents to a man- 
ual review and classification process. In a preferred embodiment 
the invention may be used to classify email messages in support 
of a message filtering or classification objective. 



